A Linear Time Biclustering Algorithm for Time Series Gene Expression Data

نویسندگان

  • Sara C. Madeira
  • Arlindo L. Oliveira
چکیده

Non-supervised machine learning methods have been used in the analysis of gene expression data obtained from microarray experiments. Recently, biclustering, a non-supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix, has been shown to be remarkably effective in a variety of applications. The goal of biclustering [1] is to find subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated behaviors. In the most common settings, biclustering is an NP-complete problem [3], and heuristic approaches are used to obtain sub-optimal solutions using reasonable computational resources [2]. There exists, however, one particular restriction of the problem that has not been considered before, and that leads to a tractable problem and, indeed, to a surprisingly efficient linear time algorithm for the problem of finding all maximal biclusters. This restriction is applicable when the gene expression data corresponds to snapshots in time of the expression level of the genes. Under this experimental setup, the researcher is, in many cases, particularly interested in biclusters with contiguous columns, that correspond to samples taken in consecutive instants in time where the genes exhibit coherent expression levels. Our algorithm is based on the use of suffix trees, built over a set of strings obtained by first discretizing the values in the gene expression matrix and then performing an alphabet transformation that associates with each symbol in the matrix the column where it belongs. The main result is that, if T is the generalized suffix tree built over these strings, then each internal node satisfying a particular condition, checkable in constant time, defines a contiguous column maximal bicluster. For example, consider (an hypothetical) discretized matrix shown on the left of the following figure:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Biclustering Algorithms for Identifying Transcriptional Regulation Relationships Using Time Series Gene Expression Data

Biclustering algorithms have shown to be remarkably effective in a variety of applications. Although the biclustering problem is known to be NP-complete, in the particular case of time series gene expression data analysis, efficient and complete biclustering algorithms, are known and have been used to identify biologically relevant expression patterns. However, these algorithms, namely CCC-Bicl...

متن کامل

Ccc-bicluster Analysis for Time Series Gene Expression Data

Many of the biclustering problems have been shown to be NP-complete. However, when they are interested in identify biclusters in time series expression data, it can limit the problem by finding only maximal biclusters with contiguous columns. This restriction leads to a well-mannered problem. Its motivation is the fact that biological processes start and conclude in an identifiable contiguous p...

متن کامل

e-CCC-Biclustering: Related work on biclustering algorithms for time series gene expression data

This document provides supplementary material describing related work on biclustering algorithms for time series gene expression data analysis. We describe in detail three state of the art biclustering approaches specifically design to discover biclusters in gene expression time series and identify their strengths and weaknesses.

متن کامل

Prognostic Prediction through Biclustering-Based Classification of Clinical Gene Expression Time Series

The constant drive towards a more personalized medicine led to an increasing interest in temporal gene expression analyzes. It is now broadly accepted that considering a temporal perpective represents a great advantage to better understand disease progression and treatment results at a molecular level. In this context, biclustering algorithms emerged as an important tool to discover local expre...

متن کامل

An Efficient Biclustering Algorithm for Finding Genes with Similar Patterns in Time-series Expression Data

Biclustering algorithms have emerged as an important tool for the discovery of local patterns in gene expression data. For the case where the expression data corresponds to time-series, efficient algorithms that work with a discretized version of the expression matrix are known. However, these algorithms assume that the biclusters to be found are perfect, in the sense that each gene in the bicl...

متن کامل

An Evaluation of Discretization Methods for Non-Supervised Analysis of Time-Series Gene Expression Data

Gene expression data has been extensively analyzed using non-supervised machine learning algorithms, with the objective of extracting potential relationships between genes. Many of these algorithms work with discretized versions of the expression data. However, the many possible methods that can be used to discretize the data have not been comprehensively studied. In this paper, we describe a n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005